AITopics

Country:

North America > United States (0.29)
North America > Canada (0.04)

Genre: Research Report > New Finding (0.47)

Industry:

Information Technology (0.68)
Government (0.47)
Semiconductors & Electronics (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsFeb-15-2026, 07:07:43 GMT

faf02b2358de8933f480a146f4d2d98e-AuthorFeedback.pdf

eigenspace overlap score, reconstruction error, uniform quantization, (12 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.53)

arXiv.org Artificial IntelligenceNov-26-2025

CafeQ: Calibration-free Quantization via Learned Transformations and Adaptive Rounding

Sun, Ziteng, Benton, Adrian, Kushnir, Samuel, Trockman, Asher, Singh, Vikas, Diggavi, Suhas, Suresh, Ananda Theertha

Post-training quantization is an effective method for reducing the serving cost of large language models, where the standard approach is to use a round-to-nearest quantization level scheme. However, this often introduces large errors due to outliers in the weights. Proposed mitigation mechanisms include applying adaptive rounding, random rotation transformations or committing to a post-training target using calibration data. Unfortunately, this reliance on calibration data can be severely limiting in some real-world scenarios as such data may be unavailable or subject to privacy regulations. In this paper, we propose algorithms to optimize transformations and adaptive rounding without access to any calibration data. The optimization is achieved by designing a suitable proxy function for the quantization loss without calibration data. To maintain inference efficiency, we perform structured matrix transformations for single matrices. For paired weights that interact directly in the computation graph, we use dual matrix transformations and adaptive rounding methods. We conduct experiments on Gemma 2 models, and observe consistent improvement over the baselines. For Gemma 2 9B quantization, our method improves the average benchmark score from 61.9 to 62.4 for 4-bit quantization and from 52.0 to 60.6 for 3-bit quantization, while adding less than 3% of computation overhead. Furthermore, our method achieves performance comparable to the commonly used GPTQ method, which requires calibration data.

large language model, machine learning, quantization, (16 more...)

2511.19705

Country:

North America > United States > Minnesota (0.28)
North America > United States > California (0.28)

Genre: Research Report (0.83)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Data Science (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Dool, Winfried van den, Zhdanov, Maksim, Asano, Yuki M., Welling, Max

Adaptive Mesh-Quantization for Neural PDE Solvers

arXiv.org Artificial IntelligenceNov-25-2025

Physical systems commonly exhibit spatially varying complexity, presenting a significant challenge for neural PDE solvers. While Graph Neural Networks can handle the irregular meshes required for complex geometries and boundary conditions, they still apply uniform computational effort across all nodes regardless of the underlying physics complexity. This leads to inefficient resource allocation where computationally simple regions receive the same treatment as complex phenomena. We address this challenge by introducing Adaptive Mesh Quantization: spatially adaptive quantization across mesh node, edge, and cluster features, dynamically adjusting the bit-width used by a quantized model. We propose an adaptive bit-width allocation strategy driven by a lightweight auxiliary model that identifies high-loss regions in the input mesh. This enables dynamic resource distribution in the main model, where regions of higher difficulty are allocated increased bit-width, optimizing computational resource utilization. We demonstrate our framework's effectiveness by integrating it with two state-of-the-art models, MP-PDE and GraphViT, to evaluate performance across multiple tasks: 2D Darcy flow, large-scale unsteady fluid dynamics in 2D, steady-state Navier-Stokes simulations in 3D, and a 2D hyper-elasticity problem. Our framework demonstrates consistent Pareto improvements over uniformly quantized baselines, yielding up to 50% improvements in performance at the same cost.

artificial intelligence, machine learning, quantization, (19 more...)

2511.18474

Country: Europe > Netherlands (0.14)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Tepper, Mariano, Willke, Ted

Individualized non-uniform quantization for vector search

arXiv.org Artificial IntelligenceSep-24-2025

Embedding vectors are widely used for representing unstructured data and searching through it for semantically similar items. However, the large size of these vectors, due to their high-dimensionality, creates problems for modern vector search techniques: retrieving large vectors from memory/storage is expensive and their footprint is costly. In this work, we present NVQ (non-uniform vector quantization), a new vector compression technique that is computationally and spatially efficient in the high-fidelity regime. The core in NVQ is to use novel parsimonious and computationally efficient nonlinearities for building non-uniform vector quantizers. Critically, these quantizers are \emph{individually} learned for each indexed vector. Our experimental results show that NVQ exhibits improved accuracy compared to the state of the art with a minimal computational cost.

information retrieval, large language model, machine learning, (19 more...)

2509.18471

Country: North America > United States > Minnesota (0.28)

Genre: Research Report (0.84)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.67)
(2 more...)

Neural Information Processing SystemsAug-20-2025, 10:31:22 GMT

On the Downstream Performance of Compressed Word Embeddings

Avner May, Jian Zhang, Tri Dao, Christopher Ré

Neural Information Processing Systems http://nips.cc/

compression quality, downstream performance, eigenspace overlap score, (13 more...)

Country:

North America > United States > California > Santa Clara County > Palo Alto (0.04)
North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre: Research Report > New Finding (0.47)

Industry:

Information Technology (0.68)
Government > Regional Government > North America Government > United States Government (0.47)
Semiconductors & Electronics (0.46)

Technology:

Information Technology > Communications (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.93)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Neural Information Processing SystemsAug-20-2025, 10:31:07 GMT

R3 and R5 asked, respectively, (1) whether we are claiming that uniform quantization is strictly better than the other

We thank all the reviewers for their thoughtful feedback. We will clarify these points. R2 and R3 had concerns about the amount of content we deferred to the appendix. In Appendix B.4, we discuss a variant of the embedding reconstruction error applicable to R2 asked about our question answering results in Section 2.3. We use the DrQA model [5], as described in Section 4. R3 proposed an idea to use non-uniform quantization to further improve the performance of quantized embeddings.

eigenspace overlap score, reconstruction error, uniform quantization, (12 more...)

Technology: Information Technology > Artificial Intelligence > Natural Language (0.53)

arXiv.org Artificial IntelligenceJul-17-2025

PoTPTQ: A Two-step Power-of-Two Post-training for LLMs

Wang, Xinyu, Nia, Vahid Partovi, Lu, Peng, Huang, Jerry, Chang, Xiao-Wen, Chen, Boxing, Cui, Yufei

Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to $3.67\times$ speed up on a NVIDIA V100, and $1.63\times$ on a NVIDIA RTX 4090, compared to uniform integer dequantization.

large language model, machine learning, quantization, (19 more...)

2507.11959

Country: North America > Canada > Quebec (0.28)

Genre: Research Report (0.82)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.31)

arXiv.org Artificial IntelligenceMay-28-2025

NeUQI: Near-Optimal Uniform Quantization Parameter Initialization

Lin, Li, Hu, Xinyu, Wan, Xiaojun

Large language models (LLMs) achieve impressive performance across domains but face significant challenges when deployed on consumer-grade GPUs or personal devices such as laptops, due to high memory consumption and inference costs. Post-training quantization (PTQ) of LLMs offers a promising solution that reduces their memory footprint and decoding latency. In practice, PTQ with uniform quantization representation is favored for its efficiency and ease of deployment since uniform quantization is widely supported by mainstream hardware and software libraries. Recent studies on $\geq 2$-bit uniform quantization have led to noticeable improvements in post-quantization model performance; however, they primarily focus on quantization methodologies, while the initialization of quantization parameters is underexplored and still relies on the suboptimal Min-Max strategies. In this work, we propose NeUQI, a method devoted to efficiently determining near-optimal initial parameters for uniform quantization. NeUQI is orthogonal to prior quantization methodologies and can seamlessly integrate with them. The experiments with the LLaMA and Qwen families on various tasks demonstrate that our NeUQI consistently outperforms existing methods. Furthermore, when combined with a lightweight distillation strategy, NeUQI can achieve superior performance to PV-tuning, a much more resource-intensive approach.

large language model, machine learning, quantization, (19 more...)

2505.17595

Country: North America > United States > California (0.28)

Genre: Research Report > Promising Solution (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.69)

Zhang, Wenlun, Ando, Shimpei, Yoshioka, Kentaro

AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model

arXiv.org Artificial IntelligenceMar-4-2025

The Segment Anything Model (SAM) has demonstrated strong versatility across various visual tasks. However, its large storage requirements and high computational cost pose challenges for practical deployment. Post-training quantization (PTQ) has emerged as an effective strategy for efficient deployment, but we identify two key challenges in SAM that hinder the effectiveness of existing PTQ methods: the heavy-tailed and skewed distribution of post-GELU activations, and significant inter-channel variation in linear projection activations. To address these challenges, we propose AHCPTQ, an accurate and hardware-efficient PTQ method for SAM. AHCPTQ introduces hardware-compatible Hybrid Log-Uniform Quantization (HLUQ) to manage post-GELU activations, employing log2 quantization for dense small values and uniform quantization for sparse large values to enhance quantization resolution. Additionally, AHCPTQ incorporates Channel-Aware Grouping (CAG) to mitigate inter-channel variation by progressively clustering activation channels with similar distributions, enabling them to share quantization parameters and improving hardware efficiency. The combination of HLUQ and CAG not only enhances quantization effectiveness but also ensures compatibility with efficient hardware execution. For instance, under the W4A4 configuration on the SAM-L model, AHCPTQ achieves 36.6% mAP on instance segmentation with the DINO detector, while achieving a 7.89x speedup and 8.64x energy efficiency over its floating-point counterpart in FPGA implementation.

activation, quantization, quantization parameter, (14 more...)

2503.03088

Country:

Europe > Switzerland > Zürich > Zürich (0.14)
Asia > Myanmar > Tanintharyi Region > Dawei (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.68)